More fun with penguins

The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs. Throughout this demo we will use the palmerpenguins dataset. To access the data, you will need to install the palmerpenguins package:

install.packages("palmerpenguins")

We load the penguins data in the same way as the previous demos:

library(tidyverse)
library(palmerpenguins)
data(penguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Pairs plot with GGally

We will use the GGally package to make pairs plots in R with ggplot figures. You need to install the package:

install.packages("GGally")

Next, we’ll load the package and create a pairs plot of just the continuous variables using ggpairs. The main arguments you have to worry about for ggpairs are data, columns, and mapping:

First, let’s create a pairs plot by specifying columns as the four columns of continuous variables (columns 3 through 6):

library(GGally)
penguins %>% ggpairs(columns = 3:6)

Obviously this suffers from over-plotting so we’ll want to adjust the alpha. An annoying thing is that we specify the alpha directionly with aes when using ggpairs:

penguins %>% ggpairs(columns = 3:6, mapping = aes(alpha = 0.5))

Plots along the diagonal show marginal distributions. Plots along the off-diagonal show joint (pairwise) distributions or statistical summaries (e.g., correlation) to avoid redundancy. The matrix of plots is symmetric; e.g., entry (1,2) shows the same distribution as entry (2,1). However, entry (1,2) and entry (2,1) display different bits of information (or alternative plots) about the same distribution.

We could also specify categorical variables in the plot. We also don’t need to specify column indices if we just select which columns to use beforehand:

penguins %>% 
  dplyr::select(bill_length_mm, body_mass_g, species, island) %>%
  ggpairs(mapping = aes(alpha = 0.5))

Alternatively, we can use the mapping argument to display these categorical variables in a different manner - and arguably more efficiently:

penguins %>% 
  ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
          mapping = aes(alpha = 0.5, color = species))

The ggpairs function in GGally is very flexible and customizable with regards to which figures are displayed in the various panels. I encourage you to check out the vignettes and demos on the package website for more examples. For instance, in the pairs plot below I decide to display the regression lines and make other adjustments to the off-diagonal figures:

penguins %>% 
  ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
          mapping = aes(alpha = 0.5, color = species), 
          lower = list(
            continuous = "smooth_lm", 
            combo = "facetdensitystrip"
          ),
          upper = list(
            continuous = "cor",
            combo = "facethist"
          )
  )

You can also proceed to customize the pairs plot in the same manner as ggplot figures:

penguins %>%
  dplyr::select(species, body_mass_g, ends_with("_mm")) %>%
  ggpairs(mapping = aes(color = species, alpha = 0.5),
          columns = c("flipper_length_mm", "body_mass_g",
                      "bill_length_mm", "bill_depth_mm")) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_bw() +
  theme(strip.text = element_text(size = 7))

Correlograms with ggcorrplot

We can visualize the correlation matrix for the variables in a dataset using the ggcorrplot package. You need to install the package:

install.packages("ggcorrplot")

Next, we’ll load the package and create a correlogram using only the continuous variables. To do this, we first need to compute the correlation matrix for these variables:

penguins_cor_matrix <- penguins %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  cor(use = "complete.obs")
penguins_cor_matrix
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2350529         0.6561813   0.5951098
## bill_depth_mm         -0.2350529     1.0000000        -0.5838512  -0.4719156
## flipper_length_mm      0.6561813    -0.5838512         1.0000000   0.8712018
## body_mass_g            0.5951098    -0.4719156         0.8712018   1.0000000

NOTE: Since there are missing values in the penguins data we need to indicate in the cor() function how to handle missing values using the use argument. By default, the correlations are returned as NA, which is not what we want. Instead, we can change this to only use observations without NA values for the considered columns (see help(cor) for more options).

Now, we can create the correlogram using ggcorrplot() using this correlation matrix:

library(ggcorrplot)
ggcorrplot(penguins_cor_matrix)

There are several ways we can improve this correlogram:

ggcorrplot(penguins_cor_matrix, type = "lower")

ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE)

ggcorrplot(round(penguins_cor_matrix, digits = 4), 
           type = "lower", hc.order = TRUE, lab = TRUE)

ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE,
           method = "circle")

You can ignore the Warning message that is displayed - just from the differences in ggplot implementation.

Parallel coordinates plot with GGally

In a parallel coordinates plot, we create an axis for each varaible and align these axes side-by-side, drawing lines between observations from one axis to the next. This can be useful for visualizing structure among both the variables and observations in our dataset. These are useful when working with a moderate number of observations and variables - but can be overwhelming with too many.

We use the ggparcoord() function from the GGally package to make parallel coordinates plots:

penguins %>%
  ggparcoord(columns = 3:6)

There are several ways we can modify this parallel coordinates plot:

penguins %>%
  ggparcoord(columns = 3:6, alphaLines = .2)

penguins %>%
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species")

penguins %>%
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
             scale = "uniminmax")

penguins %>%
  ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
             order = c(6, 5, 3, 4))